library(ThresholdR)
library(Seurat)
library(dplyr)
set.seed(42)
# we will install the 'bmcite' dataset from SeuratData::AvailableData()
#SeuratData::InstallData('bmcite')
#then load the data
#so <- SeuratData::LoadData('bmcite')
# Here we are loading the V4 seurat object of bmcite dataset:
so <- readRDS('bmcite_so.RDS')
so <- NormalizeData(so, assay = "ADT",normalization.method="CLR", margin=2)
## Normalizing across cells
# For Seurat V4 object:
adt <- as.data.frame(t(as.matrix(so@assays$ADT@data)))
# For Seurat V5 object:
#adt <- as.data.frame(t(as.matrix(LayerData(so, assay = "ADT",layer='data'))))
We assume that the maximum number of underlying populations (components) across all ADT markers is equal to 3 (trimodal). Then we will calculate BIC values which indicate how well a model is fitted. More positive BIC value (calculated using ‘mclust’ R package) indicates a better fitting:
Sys.time()
## [1] "2025-03-16 12:00:19 EDT"
bic.values <- calculateBIC(data = adt, k = 3)
## [1] "CD11a"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD11c"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD123"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD127-IL7Ra"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD14"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD16"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD161"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD19"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD197-CCR7"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD25"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD27"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD278-ICOS"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD28"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD3"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD34"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD38"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD4"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD45RA"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD45RO"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD56"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD57"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD69"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD79b"
## [1] 1
## [1] 2
## [1] 3
## [1] "CD8a"
## [1] 1
## [1] 2
## [1] 3
## [1] "HLA.DR"
## [1] 1
## [1] 2
## [1] 3
To choose the possible fits (k=2 and/or k=3) for each ADT, the user should assign a set of few markers which they believe to have bimodal or trimodal (non-unimodal) distributions. With this input, a margin is defined as the minimum value of BIC2/BIC1 divided by 3 across the set of user-provided markers. If BIC(k+1) is greater BIC(k) by at least that calculated margin in any ADT marker, k+1 is also included in the list of possible k number of components for the corresponding marker.
k.list <- get_k_list(bic_values = bic.values, markers = c("CD3", "CD14", "CD19", "CD45RA"))
## [1] "CD11a"
## [1] "CD11c"
## [1] "CD123"
## [1] "CD127-IL7Ra"
## [1] "CD14"
## [1] "CD16"
## [1] "CD161"
## [1] "CD19"
## [1] "CD197-CCR7"
## [1] "CD25"
## [1] "CD27"
## [1] "CD278-ICOS"
## [1] "CD28"
## [1] "CD3"
## [1] "CD34"
## [1] "CD38"
## [1] "CD4"
## [1] "CD45RA"
## [1] "CD45RO"
## [1] "CD56"
## [1] "CD57"
## [1] "CD69"
## [1] "CD79b"
## [1] "CD8a"
## [1] "HLA.DR"
# If the user wishes to include k=2 and k=3 for all markers in the data matrix, they could modify k_list as the following:
# k.list <- list()
# for (i in names(adt)) {
# k.list[[i]] <- c(2,3)
# }
In order to see if a marker is not meeting the threshold for any k, and is dropped out of the k_list, we can have a look:
names(bic.values)[!names(bic.values) %in% names(k.list)]
## character(0)
With the estimated list of possible k(s) for markers, we can run the fitting algorithm:
fittings <- fit_k(data = adt, k_list = k.list, margin.den = 0.1, epsilon = 0.01, maxit = 1000, maxrestarts = 10)
## CD11a: 2 modal: number of iterations= 12
## CD11c: 2 modal: number of iterations= 20
## CD11c: 3 modal: number of iterations= 79
## CD123: 2 modal: number of iterations= 30
## CD123: 3 modal: number of iterations= 44
## CD127-IL7Ra: 2 modal: number of iterations= 15
## CD14: 2 modal: number of iterations= 13
## CD14: 3 modal: number of iterations= 80
## CD16: 2 modal: number of iterations= 13
## CD16: 3 modal: number of iterations= 111
## CD161: 2 modal: number of iterations= 14
## CD19: 2 modal: number of iterations= 16
## CD19: 3 modal: number of iterations= 106
## CD197-CCR7: 2 modal: number of iterations= 19
## CD25: 2 modal: number of iterations= 30
## CD27: 2 modal: number of iterations= 18
## CD27: 3 modal: number of iterations= 49
## CD278-ICOS: 2 modal: number of iterations= 19
## CD278-ICOS: 3 modal: number of iterations= 79
## CD28: 2 modal: number of iterations= 14
## CD3: 2 modal: number of iterations= 7
## CD34: 2 modal: number of iterations= 20
## CD34: 3 modal: number of iterations= 115
## CD38: 2 modal: number of iterations= 46
## CD4: 2 modal: number of iterations= 13
## CD4: 3 modal: number of iterations= 32
## CD45RA: 2 modal: number of iterations= 25
## CD45RO: 2 modal: number of iterations= 19
## CD45RO: 3 modal: number of iterations= 166
## CD56: 2 modal: number of iterations= 12
## CD57: 2 modal: number of iterations= 17
## CD69: 2 modal: number of iterations= 16
## CD69: 3 modal: number of iterations= 127
## CD79b: 2 modal: number of iterations= 9
## CD8a: 2 modal: number of iterations= 17
## CD8a: 3 modal: number of iterations= 94
## HLA.DR: 2 modal: number of iterations= 28
## Fixing Sigma for C1 in CD11a -Bimodal# Fixing Sigma for C1 in CD11c -Bimodal# Fixing Sigma for C1 in CD11c -Trimodal# Fixing Sigma for C1 in CD127-IL7Ra -Bimodal# Fixing Sigma for C1 in CD27 -Bimodal# Fixing Sigma for C1 in CD27 -Trimodal# Fixing Sigma for C1 in CD28 -Bimodal# Fixing Sigma for C1 in CD3 -Bimodal# Fixing Sigma for C1 in CD4 -Bimodal# Fixing Sigma for C1 in CD4 -Trimodal# Fixing Sigma for C1 in CD45RA -Bimodal#
We then proceed to calculate different thresholds in bimodal and trimodal fittings:
thresholds <- get_thresholds(fittings = fittings, k_list = k.list)
publish_fittings_thr(fittings = fittings, k_list = k.list, thresholds = thresholds, path = '02Plots/', seed = 42)
Sys.time()
## [1] "2025-03-16 12:00:44 EDT"
The ThresholdR process is completed within 1 minute when running on a MacBook Pro M4 with 16 cores or 2-3 minutes in a linux environment with 1 CPU.
Save the threshold values for all markers:
# write.csv(thresholds, '03AllThresholds.csv')
We advise users to choose between suggested thresholds in the ‘03AllThresholds.csv’ file and make the ‘04All_Markers.csv’ file with the extra columns of ‘threshold_col’ and ‘threshold_value’. Then we can threshold the ADT data. This is done through subtracting the final threshold values stored in the ‘threshold_value’ column from each marker expression level in each cell. At the end, the expression levels below the thresholds are set to zero.
markers.thr <- read.csv('04All_Markers.csv')
thAb <- adt[,names(fittings)]
for (i in colnames(thAb)) {
thAb[,i] <- thAb[,i]-markers.thr[markers.thr$marker_name==i,"threshold_value"]
thAb[thAb[,i]<0,i] <- 0
}
# if you do not wish to select a threshold out of the suggested ones, you can use the bi_thr column value across the markers that have k=2 in their set of possible k(s) in within k_list:
# thAb <- adt[,names(fittings)]
# for (i in colnames(thAb)) {
# thAb[,i] <- thAb[,i]-markers.thr[markers.thr$marker_name==i,"bi_thr"]
# thAb[thAb[,i]<0,i] <- 0
# }
To annotate the 7 major cell types of B, CD4T, CD8T, classical (cMo), intermediate (iMo), and non-classical (nMo) monocytes, and NK cells, we use the thresholded values of the following set of main markers: CD3, CD4, CD8 (CD8a), CD14, CD16, CD19, and CD56:
so[["thAb"]] <- CreateAssayObject(data=t(as.matrix(thAb)))
DefaultAssay(so) <- "thAb"
so$annotation <- ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD4==0 & CD8a==0 & CD14==0 & CD16==0 & CD19>0 & CD56==0), "B",
ifelse(colnames(so) %in% WhichCells(so, expression = CD3>0 & CD4>0 & CD8a==0 & CD14==0 & CD16==0 & CD19==0 & CD56==0), "CD4T",
ifelse(colnames(so) %in% WhichCells(so, expression = CD3>0 & CD4==0 & CD8a>0 & CD14==0 & CD19==0 & CD56==0), "CD8T",
ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD14>0 & CD16==0 & CD19==0 & CD56==0), "cMo",
ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD14>0 & CD16>0 & CD19==0 & CD56==0), "iMo",
ifelse(colnames(so) %in% WhichCells(so, expression = CD3==0 & CD14==0 & CD16>0 & CD19==0 & CD56==0), "nMo",
ifelse(colnames(so) %in% union(WhichCells(so, expression = CD3==0 & CD14==0 & CD16==0 & CD19==0 & CD56>0),
WhichCells(so, expression = CD3==0 & CD14==0 & CD16>0 & CD19==0 & CD56>0)), "NK",
"Rem")))))))
table(so$annotation)
##
## B CD4T CD8T cMo iMo NK nMo Rem
## 3584 7930 5909 6457 175 1650 365 4602
Here we propose using of ADT assay for the purpose of dimension reduction:
DefaultAssay(so) <- "ADT"
so <- so %>%
ScaleData(features = rownames(so)) %>%
RunPCA(features = rownames(so))
## Centering and scaling data matrix
## Warning in irlba(A = t(x = object), nv = npcs, ...): You're computing too large
## a percentage of total singular values, use a standard svd instead.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## Warning: Requested number is larger than the number of available items (25).
## Setting to 25.
## PC_ 1
## Positive: CD3, CD28, CD27, CD127-IL7Ra, CD278-ICOS, CD4, CD8a, CD161, CD25, CD45RO
## CD45RA, CD197-CCR7
## Negative: HLA.DR, CD11c, CD14, CD38, CD69, CD123, CD11a, CD34, CD19, CD79b
## CD16, CD56
## PC_ 2
## Positive: CD45RO, CD11a, CD14, CD4, CD11c, CD28, CD3, CD69, CD278-ICOS, CD127-IL7Ra
## CD38, CD161
## Negative: CD45RA, CD19, CD79b, CD197-CCR7, CD8a, CD57, CD56, CD16, CD27, CD25
## HLA.DR, CD34
## PC_ 3
## Positive: CD16, CD56, CD8a, CD11a, CD57, CD161, CD45RA, CD38, CD3, CD127-IL7Ra
## CD27, CD11c
## Negative: CD19, CD79b, CD4, CD25, HLA.DR, CD197-CCR7, CD69, CD34, CD28, CD45RO
## CD14, CD278-ICOS
## PC_ 4
## Positive: CD161, CD25, CD45RO, CD56, CD16, CD4, CD79b, CD19, CD28, CD57
## CD127-IL7Ra, HLA.DR
## Negative: CD8a, CD27, CD14, CD45RA, CD3, CD11c, CD69, CD38, CD197-CCR7, CD11a
## CD34, CD123
## PC_ 5
## Positive: CD4, CD38, CD123, CD16, CD34, CD45RA, CD56, CD278-ICOS, CD28, CD197-CCR7
## CD27, CD3
## Negative: CD45RO, CD8a, CD69, CD161, CD79b, CD19, CD25, CD57, CD14, CD11a
## CD127-IL7Ra, HLA.DR
ElbowPlot(so,ndims = 24)
Choosing the first 20 components from PCA:
so <- so %>%
FindNeighbors(dims = 1:20)%>%
FindClusters(resolution = 0.3, random.seed = 42) %>%
RunUMAP(dims = 1:20, spread = 1.5, min.dist = 0.5, seed.use = 42)
## Computing nearest neighbor graph
## Computing SNN
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 30672
## Number of edges: 1069725
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9569
## Number of communities: 15
## Elapsed time: 3 seconds
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 12:00:51 UMAP embedding parameters a = 0.3593 b = 1.149
## 12:00:51 Read 30672 rows and found 20 numeric columns
## 12:00:51 Using Annoy for neighbor search, n_neighbors = 30
## 12:00:51 Building Annoy index with metric = cosine, n_trees = 50
## 0% 10 20 30 40 50 60 70 80 90 100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 12:00:52 Writing NN index file to temp file /var/folders/5_/mvpjsxr157b_s5l6r66v67fh0000gp/T//RtmppzsaAJ/filef341664a3ba0
## 12:00:52 Searching Annoy index using 1 thread, search_k = 3000
## 12:00:56 Annoy recall = 100%
## 12:00:56 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 12:00:57 Initializing from normalized Laplacian + noise (using irlba)
## 12:00:58 Commencing optimization for 200 epochs, with 1349406 positive edges
## 12:01:05 Optimization finished
DimPlot(so, group.by = 'annotation')
Since we have not included CD34 in our main markers, the (CD34+) myeloid progenitors are annotated as the remaining (Rem) cell types. We will plot the denoised expression levels of the 7 main markers plus CD34 next:
library(ggplot2)
library(patchwork)
for (i in c("CD3","CD4","CD8a","CD14","CD16","CD19", "CD56", "CD34")) {
p.noisy <- FeaturePlot(so, features = paste0("adt_",i), slot = "data")+
ggtitle(label = paste("Noisy", i))
p.denoised <- FeaturePlot(so, features = paste0("thab_",i), slot = "data")+
ggtitle(label = paste("Denoised", i))
print(wrap_plots(p.noisy, p.denoised, ncol = 2))
}
To see how ThresholdR changed the noise across the major cell types, we exclude the cells with Rem label, and then compare the two dotplots:
DefaultAssay(so) <- "ADT"
print(DotPlot(object = subset(so, cells = WhichCells(so, expression = annotation!="Rem")), features = c("CD3","CD4","CD8a","CD14","CD16","CD19", "CD56"), group.by = 'annotation', dot.min = 0.05, scale = F)+
ggtitle(label = paste("Noisy ADT")))
DefaultAssay(so) <- "thAb"
print(DotPlot(object = subset(so, cells = WhichCells(so, expression = annotation!="Rem")), features = c("CD3","CD4","CD8a","CD14","CD16","CD19", "CD56"), group.by = 'annotation', dot.min = 0.05, scale = F)+
ggtitle(label = paste("Denoised thAb")))
## Warning: Removed 33 rows containing missing values or values outside the scale range
## (`geom_point()`).